Autonomous investigation timed out

Monitor: View in Axiom
Background Job: update_autonomous_timedout (in the backend service)

🔍 Overview

This monitor detects when an autonomous investigation has been inactive for over one hour — meaning no new steps have been completed within that period.

The update_autonomous_timedout background job runs in the backend service every hour. It checks for investigations whose most recent step finished more than one hour ago. If any are found, it logs an "Autonomous investigation timed out" message and triggers this monitor.

💡 Note: An investigation can legitimately take several hours as long as new steps are being completed periodically. This monitor only fires if progress completely stops for over an hour.

⚠️ This condition should never happen under normal circumstances. A triggered monitor always indicates a bug in the code (e.g., unhandled exception, async deadlock, or missing exit condition) — not a customer configuration issue.

🧩 Step 1: Identify the Stuck Investigation

Run the following query in Axiom to find all investigations that timed out:

logs
| where body == "Autonomous investigation timed out" and ['resource.deployment.environment'] == "production"
| project _time, org_name,['attributes.customer_id'], ['attributes.session_id'],['attributes.type'], last_event_time = unixtime_milliseconds_todatetime(['attributes.last_event_time'])

From the results, copy the attributes.session_id values — each corresponds to a timed-out investigation.
You’ll use these session IDs in the next steps to analyze logs, traces, and relevant code paths.

🕵️ Step 2: Investigate the Session in Logs, Traces, and Code

For each session_id found in Step 1, investigate the related logs and traces to understand where the investigation got stuck.

Start with WARN and ERROR logs

logs
| where * contains "RELEVANT_SESSION_ID"
| where severity_text == "WARN" or severity_text == "ERROR"

traces
| where * contains "RELEVANT_SESSION_ID"
| where ['status.code'] == "ERROR"

Look for the last successful action before the investigation stopped progressing.
Identify any exceptions, trace errors, or timeouts that might indicate where execution stalled.

If you don’t find any relevant WARN or ERROR logs:

Expand your search to include INFO logs — these can help identify the final successful step before the timeout.

Once you identify the last log or trace entry before the stall, open the corresponding code in the backend or investigation-worker repositories.
Search for the log message or function name, and follow the logic in the code to understand:

What operation was being performed.
Which async function, task, or dependency it was waiting on.
Whether there are missing error handlers, timeouts, or termination conditions.

🪵 Step 3: Check AWS ECS / CloudWatch (If Logs Are Missing)

Sometimes Axiom may not contain all relevant logs or traces (e.g., when the backend process crashes or hangs before flushing).
If this happens, check AWS CloudWatch Logs for the relevant logs/traces.

See how to do it HERE

🔍 Overview​

🧩 Step 1: Identify the Stuck Investigation​

🕵️ Step 2: Investigate the Session in Logs, Traces, and Code​

Start with WARN and ERROR logs​

🪵 Step 3: Check AWS ECS / CloudWatch (If Logs Are Missing)​

🔍 Overview

🧩 Step 1: Identify the Stuck Investigation

🕵️ Step 2: Investigate the Session in Logs, Traces, and Code

Start with WARN and ERROR logs

🪵 Step 3: Check AWS ECS / CloudWatch (If Logs Are Missing)